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Abstract 

c ■ 

' In this article I outline the ideas behind the Open Archives Initiative metadata harvesting protocol 

, (OAIMH), and attempt to clarify some common misconceptions. I then consider how the OAIMH proto- 

QQ ' col can be used to expose and harvest metadata. Perl code examples are given as practical illustration. 

(N : 

J '. 1 Introduction 

' The Open Archives Initiative (OAI) [1] announced the OAI metadata harvesting protocol (OAIMH) vl.O [2] 

O ■ on 21 January 2001 after a period of pre-release testing. It is intended that the protocol will not be changed 
until 12-18 months have elapsed from the initial release. This period of stability is designed to allow time 

. for thorough evaluation without the cost of multiple rewrites for early implementers. 

' The OAIMH protocol was designed as a simple, low-barrier way to achieve interoperability through metadata 

\^ [ harvesting. It is still an open question as to exactly how useful metadata sharing will be. However, there is 

' certainly considerable interest in OAI and experience with early OAIMH implementations is encouraging. 

\o ■ 

' This tutorial is organized in four main sections. In section 2, 1 hope to clear up some common misconceptions 

I . about what OAIMH is. In section 3, I review some of the concepts and assumptions that underly the 

' OAIMH protocol. Then, in the remaining two sections, sections 4 and 5, I consider implementation of the 

C/j , data-provider and service-provider sides of the OAIMH protocol. Perl code examples are given to implement 

O ■ bare-bones versions of these two interfaces. 

. ^ ' It is not my intention to offer a complete description of the OAIMH protocol but instead to describe its 

. use in very practical terms, and to highlight common practice among implementers. A copy of the OAIMH 

5— ( ' protocol specification [2] should be at hand while reading this tutorial. I will refer to sections within the 

. 1 protocol specification as §2.1 (for section 2.1). 



2 What OAIMH is not 

The most common misconception of OAIMH, as it currently stands, is that it provides mechanisms to expose 
and harvest full-content (documents, images, . . . ). This is not true, OAIMH is a protocol for the exchange 
of metadata only. However, it may be that a future OAI protocol will provide facilities for the exchange of 
full-content. 

OAIMH is not about direct interoperability between archives. It is based on a model which puts a very 
clean divide between data-providers (entities which expose metadata) and service-providers (entities which 
harvest metadata, presumably with the intention of providing some service). 

While the model has a clear divide between data-providers and service-providers, there is nothing to say 
that one entity cannot be both; Cite Base [3] is one example. The model has an obvious scalability problem 
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if every service-provider is expected to harvest data from every data-provider. It may be that is is not 
an issue if service-providers are specific to a particular community and thus harvest only from a subset of 
data-providers. Wc may also see the creation of aggregators which harvest from a number of data-providers 

and the rc-cxport this data. 

OAIMH is not limited to Dublin Core (DC) [4] metadata. However, since OAI aims to promote inter- 
operability, DC metadata has been adopted as a lowest common-denominator metadata format which all 
data- providers should support. It is not intended that the requirement to export DC metadata should pre- 
clude the use of other metadata sets that may be more appropriate within particular communities. The 
OAI encourages the development of community specific standards that provide the functionalities required 
by specific communities. 



3 OAIMH concepts 

3.1 Pull-only interaction via HTTP using XML 

Service-providers make requests to data-providers; there is no support for data-provider driven interaction. 
All requests and replies occur using the HTTP protocol [5] . Requests may be made using either the HTTP 
GET or POST methods. All successful replies are encoded in XML [6], and all exception and flow-control 
replies are indicated by HTTP status codes. 



3.2 Verbs 

OAIMH protocol requests are made using one of six verbs: Identify, GetRecord, Listldentifiers, ListRecords, 
ListSets, and ListMetadata Formats. Some of these verbs accept or require additional parameters to completely 
specify the request. The verb and any parameters are specified as key=value pairs §3.1.1 either in the URL 
(if the GET method is used), or in the body of the message (if the POST method is used). 



3.3 Items, records and identifiers 

The OAIMH protocol is based on a model of repositories that hold metadata about items §2. The nature of 
the items is outside the scope of the protocol; they might be electronic documents, artifacts in a museum, 
people, or almost anything else. The requirement for OAI compliance is that the repository be able to 
disseminate metadata for these items in one or more formats including Dublin Core (DC). 

Metadata is disseminated via the GetRecord and ListRecords verbs. These requests result in zero or more 
records being returned. A record consists of 2 or 3 parts: a <header> container, a <metadata> container. 
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and possibly an <about> container §2.2. 

The metadata for each item has a unique identifier which, when combined with the metadata Prefix, acts 
as a key to extract a metadata record. Note that although all metadata types for an item share the same 
identifier, the identifier is explicitly not an identifier for the item §2.3. Identifiers may be any valid URI [7] 
but an optional OAI identifier syntax §A2 has been adopted widely. The OAI identifier syntax divides the 
identifier into three parts separated by colons (:), e.g. oai:arXiv:hep-lat/0008015 where 'oai' is the scheme, 
'arXiv' identifies the repository, and 'hep-lat/0008015' is the identifier within the particular repository. 

3.4 Datestamps 

The metadata for an item (considered as a whole, not as individual formats) has a datestamp which is 
the date of last modification. The purpose of the datestamp is to support date-selective harvesting and 

incremental harvesting in particular. Datestamps are returned in the records returned by a data-provider 
and may bo used as optional arguments to the Listldentifiers and ListRecords requests. 

The datestamps have the granularity of a day, they are in YYYY-MM-DD format and no time is specified. 
This simple date format actually creates some additional complexity because the service-provider and data- 
provider may not be in the same time-zones. This is considered further in section 5.2. 

Typically, a service-provider would initially harvest all metadata records from a repository by issuing 
a ListRecords request without from or until restrictions. Subsequently, the service-provider would issue 
ListRecords requests with a from parameter equal to the date of the last harvest. 

3.5 Sets 

Sets are provided as an optional construct for grouping items to support selective harvesting §2.5. It is not 
intended that they should provide a mechanism by which a search is implemented, and thcrc^ is no c;ontrolled 
vocabulary for set names so automated interpretation of set structure is not supported. It should be noted 
that sets are optional both from the point of view of the data-provider — which may or may not implement 
sets: and the service-provider which may ignore any set structure that is exposed. It is not clear whether 
sets will be widely used and I shall not consider them further in this tutorial. 

3.6 Metadata formats 

The OAIMH protocol supports multiple parallel m,etadata formats. Dublin Core (DC) is mandated for lowest 
common denominator interoperability. The use of other formats within particular communities or for special 
purposes is encouraged. Within a particular repository, metadata formats are identified by a metadata Prefix. 
Each metadata Prefix is associated with the URL of the schema which may be used to validate the metadata 
records; the URL has cross repository scope. The only globally meaningful metadataPrefix is oai_dc (for DC), 
which is associated with the schema at http://www.openarchives.org/OAI/dc.xsd. 

The ListMetadataFormats request will return the metadataPrefix, schema, and optionally a metadataNames- 

pace, for either a particular record or for the whole repository (if no identifier is specified). In the case of 
the whole repository, all metadata formats supported by the repository are returned. It is not implied that 
all records are available in all formats. 

3.7 Exception conditions 

The OAIMH protocol has very simple exception handling: syntax errors result in HTTP status code 400 
replies, and parameters that arc invalid or have values that do not match records in the repository result 
in empty replies. For example, a ListRecords request for a date range when there were no changes, or for a 
metadata format not supported, will result in a reply with header information but no < record > elements. 
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3.8 Flow control, load balancing and redirection 



Flow control is supported with the HTTP retry-after status code 503. This allows a server (data-provider) to 
tell the harvesting agent (service-provider) to try the request again after some interval. It is left entirely up 
to the server implementer to determine the conditions under which such a response will be given. The server 
could base the response on current machine load or limit the frequency requests will be serviced from any 
given IP address. The retry-after response may also be used to handle temporary outages without simple 
taking the server off-line. 

In an environment where one of a set of servers may handle a request, the server may dynamically redirect a 
request using the HTTP 302 response. To date this has been implemented only by the NACA repository [8] . 

4 Exposing metadata 

To expose metadata within the OAI, one must implement the data-provider site of the OAIMH protocol. 
This provides a small set of functions which can be used to extract information about and metadata from 
the underlying repository. 

4.1 Minimal server implementation 

The Perl files oail.pl, OAlServer.pm and Database. pm implement a bare-bones data-provider interface. The 
file oail.pl handles HTTP requests and must be associated with a URL in the web server configuration file; 
for the Apache [9] web server, the configuration line is ScriptAlias /oail /some/directory/oail.pl if the code 
is in /some/directory. It is also possible to run oail.pl from the command line, the request is specified with 
the -r flag, e.g. ./oail. pi -r 'verb=ldentify'. 

The algorithm for oail.pl is simply: 

read GET, POST or commaiid line request 
check syntax of request 
if syntax correct 

return XML reply to request 
else 

return HTTP 400 error code and message 

An example of an invalid request is: 

simeon(Sfff>./oail.pl -r 'bad-request' 
Status: 400 Malformed request 
Content-Type : text/plain 

No verb specified! 

OAlServer.pm exports two subroutines, one (OAlCheckRequest) to check the request against a grammar stored 
in a data structure, and another (OAlSatisfyRequest) which calls the appropriate routine to implement the 
required OAI verb. I will consider each verb in turn. 

Database. pm is a dummy database interface with a 'database' of three records: recordl, record2 and records. 

Metadata for recordl and record2 is available in DC format; metadata for recordl is also available in another 
format with the metadataPrefix 'wibble'; and records is a 'deleted' record so no metadata is available. 
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4.2 Identify 



This verb takes no arguments and returns information about a repository §4.2. The example code im- 
plements Identify by simply writing out information from configuration variables. The protocol allows for 
additional <description> blocks which may contain community- specific information. Examples include <oai- 
identifier> which specifies a particular identifier syntax, and <eprints> which includes additional information 
appropriate for the e-print community §A2. 

4.3 ListSets 

This verb takes no arguments and returns the set structure of the repository §4.6. The example code does 
not implement any sets so the response is an empty list. 

4.4 ListMetadataFormats 

This verb may be used cither with a identifier argument or without any arguments §4.4. If an identifier is 
specified then the verb returns the metadata formats available for that record. In many cases a repository 
may be able to disseminate metadata for all records in the same format or formats. In this case the response 
will be the same if there is no identifier argument or if the identifier argument specifics any record that 
exists. The example code implements the general case by calling a routine in the Database. pm module to ask 
what formats are available, and then formats the reply appropriately. For each metadata format, the reply 
must include a <metadataPrefix> (used to specify that format in other requests), and a <schema> URL. A 
< metadata Namespace> element may optionally be returned but is not implemented in the example code. 

An example request and response is: 

simeonQf f f > . /oail .pi -r ' verb=ListMetadataForinats&identif ier=recordl ' 
Content-Type : text/xml 

<?xiiil version="l .0" encoding="UTF-8"?> 

<ListMetadataForinats xinlns="littp: //www. openarchives . org/OAI/OAI_ListMetadataForinats" 
xsi : schemaLocation="http: //www. openarchives . org/OAI/ 1 . 0/OAI_ListMetadataFormats 

http : //www . openarchives . org/OAI/ 1 . 0/OAI_ListMetadataFormats . xsd" > 
<responseDate>2001-05-05T12 : 27 : 36-06 : 00</responseDate> 
<requestURL>http : //localhost/oail?verb=ListMetadataFormats& 

identif ier=recordl& verb=ListMetadataFormats</requestURL> 
<metadataFormat> 
<metadataPref ix>wibble</metadataPref ix> 
<schema>http : / / wibble . org/wibble . xsd</scheina> 
</met adat aFormat > 
<met adat aFormat > 
<metadataPref ix>oai_dc</metadataPref ix> 
<scheina>http : //www . openarchives . org/OAI/dc . xsd</ schema> 
</metadataFormat> 
</ListMet adat aFormat s> 

The response indicates that the record recordl may be disseminated in either oai_dc or wibble formats. 

4.5 GetRecord 

This verb requests metadata for a particular record in a particular format §4.1. The example code implements 
this as a call to a subroutine disseminate (shared with ListRecords) after checking that the record exists. 
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The record returned consists of two parts if the record is not deleted; a <header> block which contains 
the identifier and the datestamp (the information required for harvesting) and a < metadata > block which 
contains the XML metadata record in the requested format. The <metadata> block will be missing if the 
record is deleted or if the requested metadata format is not available. 

For example, a request for oai_dc for record2 would be: 

simeonSf f f > . /oail .pi -r ' verb=GetRecord&identif ier=record2&metadataPref ix=oai_dc' 
Cont ent -Type : t ext /xml 

<?xml version="1.0" encoding="UTF-8"?> 

<GetRecord xmlns="http : //www. openar chives . org/OAI/OAI_GetRecord" 

xsi : schemaLocation="http : //www. openar chives . org/OAI/1 . 0/OAI_GetRecord 

http : //www . openar chives . org/DAI/1 . 0/OAI_GetRecord. xsd"> 
<responseDate>2001-05-05T12 : 50 : 23-06 : 00</responseDate> 

<requestURL>http : //localhost/ oail?verb=GetRecord&amp ; identif ier=record2& 
metadataPref ix=oai_dc&verb=GetRecord</requestURL> 

<record> 
<header> 

<identif ier>record2</identif ier> 

<datestamp>1999-02-12</datestainp> 
</header> 
<metadata> 

<oai_dc xsi : schemaLocation="http ://purl . org/dc/elements/1 . 1/ 

http : //www . openarchives . org/OAI/ dc . xsd" 
xmlns : xsi="http : //www. w3 . org/2000/ lO/XMLSchema-insteince" 
xmlns="http : //purl . org/dc/elements/l . !/"> 
<title>Iteiii 2</title> 
<creator>A N Other</creator> 
</ oai_dc> 
</metadata> 
</record> 
</GetRecord> 

but a request for the unavailable format wibble would be: 

simeonSf f f > . /oail .pi -r ' verb=GetRecord&identif ier=record2&metadataPref ix=wibble' 
Cont ent -Type : t ext /xml 

<?xinl version="l .0" encoding="UTF-8"?> 

<GetRecord xmlns="http : //www. openarchives . org/OAI/OAI_GetRecord" 
xsi : schemaLocation="http : //www . openarchives . org/OAI/ 1 . 0/OAI_GetRecord 

http : //www. openarchives . org/OAI/1 . 0/OAI_GetRecord.xsd"> 
<responseDate>2001-05-05T12 : 52 : 13-06 : 00</responseDate> 
<requestURL>http : //localhost/oail?verb=GetRecord&amp ; 

ident if ier=record2&amp ; metadataPref ix=wibble&amp ; verb=GetRecord</ requestURL> 
<record> 
<header> 

<identif ier>record2</identif ier> 
<datestamp>1999-02-12</datestamp> 
</header> 
</record> 
</GetRecord> 
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which includes a <header> block but no <metadata> block. 

The protocol also permits the addition of an <about> container §2.2 for each record This is provided as a 
hook for additional information such as rights or terms information. It is not currently used by any of the 
registered OAI data-providers and is not implemented in the example code. 

4.6 Listldentifiers and ListRecords 

Listldentifiers §4.3 and ListRecords §4.5 both implement a search by date, the difference is whether they 
return a list of identifiers or complete metadata records in the specified format. The example code implements 
both of these verbs using the subroutine listEither which calls a search by date (getldsByDate) in Database. pm. 

In the case of Listldentifiers the output consists of <identifier> elements which may include the attribute 
status=" deleted" if the record is deleted. An example request without date restriction is: 

simeonQf f f > . /oail .pi -r 'verb=ListIdentif iers' 
Content-Type : text/xml 

<?xiiil version="l .0" encoding="UTF-8"?> 

<List Identifiers xinlns="http: //www. openarchives . org/OAI/OAI_List Identifiers" 
xsi : scheinaLocation="http : //www . openarchives . org/DAI/1 . 0/OAI_ListIdentif iers 

http : //www . openarchives . org/OAI/1 . 0/OAI_ListIdentif iers . xsd" > 
<responseDate>2001-05-05T12 : 59 : 30-06 : 00</responseDate> 

<requestURL>http : //localhost/oail?verb=ListIdentif iers& verb=ListIdentif iers</requestURL> 
<identif ier>recordl</ identif ier> 
<identif ier>record2</identif ier> 
<identif ier status="deleted">record3</identif ier> 
</ListIdentif iers> 

The response lists the; identifiers of the three records in the repository and indicates that records is deleted. 
If the parameter until=2000-01-01 were added then only the first two identifiers would be returned since the 
datestamp of records is 2000-03-13. 

In the case of ListRecords the output consists of <record> blocks similar to those obtained from GetRecord 
requests. ListRecords requests must include a metadata Prefix parameter. 

4.7 Partial responses 

The OAIMH protocol allows for partial responses §3.4 for all of the list verbs (Listldentifiers, ListSets, List- 
MetadataFormats and ListRecords). This feature has been implemented by most of the larger registered OAI 
repositories for the Listldentifiers and ListRecords verbs. The example code does not implement this feature. 

5 Harvesting metadata 

To harvest metadata within the OAI, one must implement the service-provider site of the OAIMH protocol. 
I will consider the implementation of a harvester that performs two functions: firstly, harvest all metadata 
in a particular format, and secondly, harvest all metadata in a particular format that has changed since a 
given date. These functions are the basis of a system that can create and maintain an up to date copy of 
the metadata from an OAI compliant repository. 

As one of the maintainers of a heavily used archive I am painfully aware of the importance of avoiding 
inadvertent denial-of-service attacks created by badly written harvesting software. Automated agents should 
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always inchidc a useful user-agent string and a valid e-mail contact address in their HTTP requests. The 
flow-control elements of the protocol must be respected and careful testing is essential. 

5.1 Detecting changes that require manual intervention 

I will assume that the goal is to create software which will run on some schedule so that the local copy of 
metadata from some set of repositories is kept updated without manual intervention. However, it would be 
reckless to assume that the details of repositories will not change over time. In order to avoid the need for 
manual polling to detect such changes, we should ask how they can be detected automatically. 

To detect changes other than the addition and deletion of records which arc part of normal repository 
operation, one can compare the response to OAI requests that describe the repository between successive 
harvests. These requests are Identify and probably ListSets and ListMetadataFormats (for the whole repository 
as opposed to any single record). For all of the requests we expect the <responseDate> to change with each 
request but for these requests we expect the rest of the response to be unchanged. Note that to do the 
test correctly one should compare the XML data in such a way that valid transformations, say re-ordering 
elements, are ignored. However, in practice it is likely to be sufficient (if over sensitive) to do a string 
comparison of the responses so long as changes in the <responseDate> are ignored. 

In the example harvester I have included the facility to specify a file containing the Identify response from 
the previous harvest. This is used both to extract the date of the last harvest and to check for changes in 
that response. I have not implemented a test for changes in the ListSets and ListMetadataFormats responses. 

5.2 Incremental harvesting 

The OAIMH protocol was designed to facilitate incremental harvesting. The idea is that a service-provider 
will maintain an up to date copy of the metadata from a repository by periodically harvesting 'changed' 
records. This is why all records have a datestamp, the date of last modification, associated with them. 

The 1 day gramdarity of the datestamp and the possibility of data-providers and service-providers being 
in different time zones means that there must be some overlap between the date ranges of successive re- 
quests [10]. If the service- provider and data-provider share the same time- zone then a 1 day overlap is 
sufficient to ensure that updates are not missed; records might be updated after the harvest on the day of 
the last harvest, but provided records that have changed on that day are rcharvcstcd then no changes will 
be missed. To cope with different time zones it is necessary to extend this to a 2 day overlap if the harvester 
works with dates local to itself. An alternative strategy, which I prefer, is to use only the dates returned by 
the repository and thus, by working in the local time zone of the repository, reduce the required overlap to 
1 day. 

In the example harvester I implement this last strategy by taking the date of the last harvest from the 
<responseDate> of the stored Identity response (the <responseDate> must be specified in the local time zone 
of the repository §3.1.2.1. This date may then be used as the from date (inclusive) for the next ListRecords 
or Listldentifiers request. 

5.3 Flow control and redirection 

The module OAlGet.pm examines the HTTP reply for status codes 302 (redirect) and 503 (retry-after). Both 
replies are handled automatically, a default retry period is assumed if the 503 response does not specify a 
time (though this is an error on the part of the data-provider). Messages are printed if the verbose option 
is selected. 
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5.4 Parsing replies 



OAIMH protocol replies are designed to be self contained, in part to allow off-line processing thereby sep- 
arating the harvesting and database-update processes. However, in order to deal with partial responses 
the harvesting software must be able to parse the responses to all the list requests (Listldentifiers, ListSets, 
ListMetadataFormats and ListRecords) sufSciently to extract any resumptionToken §3.4. To date, none of the 
registered OAI compHant repositories give partial responses for ListSets and ListMetadataFormats requests, 
but several do for Listldentifiers and ListRecords requests. 

Perhaps the neatest way to implement a harvester would be to have it recombine partial responses into 
a complete reply. The example code does not do this but does parse all list requests to look for a 
<resumptionTol<en> so that further requests can be used to complete the original request. 

5.5 An example heirvester 

The files oalharvest.pl, OAlGet.pm and OAI Parser, pm implement a simple harvester that illustrates the points 
mentioned above, oaiharvest.pl is the executable and accepts a variety of flags, these can be displayed by 
executing oaiharvest.pl -h. The algorithm is: 

read command line arguments 
check options eind parameters 
issue Identify request 

compare response with previous Identify response if given 

extract 'from' date from commeind line, previous Identify response or do complete harvest 
LOOP: 

issue ListRecords or Listldentifiers request 
check for resumptionToken, LOOP if present 

The subroutine OAlGet in OAlGet.pm is used to issue the OAIMH requests and this handles any retry-after 
or redirect replies. XML parsing is handled by the OAI Parser. pm module which extends XML-Parser, which 
itself is based on the expat parser. 

Let us take as an example, harvesting metadata from the example data-provider code which has be set up 
at the URL http:/ /localhost/oail. First wc would issue a harvest command without any time restriction (to 
harvest all records). In the examples, I harvest just the identifiers using Listldentifiers requests, the flags -r 
and -m metadata Prefix can be used to instruct oaiharvest.pl to issue ListRecords requests and to specify a 
metadata Prefix other than oai_dc. 

simeonQf f f >mkdir harvest 1 

simeon®fff>./oaiharvest.pl -d harvestl http: //localhost/oail 

oaiharvest.pl: Harvest from http: //localhost/oail using POST 
OAlGet: Doing POST to http: //localhost/oail args: verb=Identif y 
OAlGet: Got 200 OK (479bytes) 
oaiharvest.pl: Doing complete harvest. 

OAlGet: Doing POST to http: //localhost/oail args: verb=ListIdentif iers 
OAlGet: Got 200 OK (537bytes) 

oaiharvest.pl: Got 3 identifiers (running total: 3) 
oaiharvest.pl: No resumptionToken, request complete. 
oaiharvest.pl: Done. 

simeonSf f f >ls harvestl 
Identify Listldentifiers . 1 
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If we then do an incremental harvest specifying the file name of the last Identify response, harvestl/ldentify, 
the harvester checks against this response for changes (none except date) and extracts the date of the last 
harvest (2001-06-05) to be used as the from date for the new harvest. 

simeonQf f f >mkdir harvest2 

simeonSf f f > . /oaiharvest .pi -d harvest2 -i harvestl/ldentify http://localhost/oail 

oaiharvest.pl: Harvest from http://localhost/oail using POST 
OAIGet: Doing POST to http : //localhost/oail args: verb=Identif y 
OAIGet: Got 200 OK (479bytes) 

oaiharvest.pl: Identify response unchanged from reference (except date) 
oaiharvest.pl: Reading harvestl/ldentify to get from date 

oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvestl/ldentify) 

OAIGet: Doing POST to http : //localhost/oail args: f rom=2001-06-05&verb=ListIdentif iers 

OAIGet: Got 200 OK (444bytes) 

oaiharvest.pl: Got identifiers (running total: 0) 
oaiharvest.pl: No resumptionToken, request complete. 
oaiharvest.pl: Done. 

Since there have been no changes in the database this harvest results in no identifiers being returned. 

To extend this example, I then edited the database (Database. pm) to add a new record (record4) with 
datestamp 2001-06-05 which simulates the addition of a record after the last harvest but on the same day. 
I then ran another harvest command. 

simeonQf ff>diff Database. pm~ Database. pm 
24c24,26 

< 'records' => [ '2000-03-13', undef ] #deleted 

> 'record3' => [ '2000-03-13', undef ], #deleted 

> 'record4' => [ '2001-06-05', { 

> 'oai_dc' => ['title' , 'Item 4' , 'creator' , 'Someone Else'] } ] 
simeonOf f f >mkdir harvests 

simeon(afff>./oaiharvest.pl -d harvests -i harvest2/Identif y http: //localhost/oail 

oaiharvest.pl: Harvest from http : //localhost/oail using POST 
OAIGet: Doing POST to http: //localhost/oail args: verb=Identif y 
OAIGet: Got 200 OK (479bytes) 

oaiharvest.pl: Identify response unchanged from reference (except date) 
oaiharvest.pl: Reading harvest2/Identif y to get from date 

oaiharvest.pl: Incremental harvest from 2001-06-05 (from harvest2/Identif y) 

OAIGet: Doing POST to http : //localhost/oail args: f rom=2001-06-05&verb=ListIdentif iers 

OAIGet: Got 200 OK (478bytes) 

oaiharvest.pl: Got 1 identifiers (running total: 1) 
oaiharvest.pl: No resumptionToken, request complete. 
oaiharvest.pl: Done. 

This harvest results in one additional identifier, record4, being returned as expected. 

Below are two excerpts from harvests from real repositories which illustrate the flow-control features of the 
protocol. The first is from arXiv which uses 503 retry-after replies to enforce a delay between requests. The 
second if from NACA which uses 302 redirect replies to demonstrate a load-sharing scheme. 
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OAIGet: Doing POST to http://arXiv.org/oail args : verb=ListIdentif iers 
OAIGet: Got 503, sleeping for 60 seconds... 
OAIGet: Woken again, retrying... 
OAIGet: Got 200 OK (27398bytes) 

oaiharvest.pl: Got 502 identifiers (running total: 502) 
oaiharvest.pl: Got resumptionToken: '1997-02-10 ' 

OAIGet: Doing POST to http://arXiv.org/oail args: resiimptionToken=1997-02-10 &verb=ListIdentif iers 

OAIGet: Got 503, sleeping for 60 seconds... 
OAIGet: Woken again, retrying... 
OAIGet: Got 200 OK (28330bytes) 

oaiharvest.pl: Got 520 identifiers (running total: 1022) 
oaiharvest.pl: Got resumptionToken: '1997-03-06 ' 



OAIGet: Doing POST to http://naca.larc.nasa.gov/oai/ args: verb=ListIdentif iers 

OAIGet: Got 302, redirecting to http://buckets.dsi.internet2.edu/naca/oai/?... 

OAIGet: Doing POST to http://buckets.dsi.internet2.edu/naca/oai/ args: verb=ListIdentif iers 

OAIGet: Got 200 OK (336705bytes) 

oaiharvest.pl: Got 6352 identifiers (running total: 6352) 



I hope the examples above provide a useful demonstration of some of the features of the OAIMH metadata 
harvesting. Be sure to exercise caution and restraint when running tests against registered repositories. 
There is some cost in associated with answering OAIMH requests, and recklessly downloading large amounts 
of data for no good reason is not helpful. 

6 Conclusions 

The OAIMH protocol has been public for 5 months now and experience shows that it is adequate for its 
intended purpose. There are now 30 registered repositories which together expose over 600,000 metadata 
records. While there are currently just two registered service providers, 'arc' [11] and the Repository Ex- 
plorer [12], there is an increasing number of tools and libraries available to assist in the development of 
harvesting applications. Publicly available tools and libraries are listed on the OAI web site [13]. This 
includes Tim Brody's [14] Perl library which is considerably more extensive than the examples presented 
here. 

The uptake of OAI is very encouraging and it is feedback from the current implementers which will shape 
the next version of the OAIMH protocol. Anyone implementing, or interested in implementing, either side 
of the OAIMH protocol should subscribe to the oai-implementers [15] mailing list. It is a helpful and friendly 
forum. 



Appendix: Example programs 

The example programs are: 

• oail.pl and OAlServer.pm for the server; and 

• oaiharvest.pl, OAIGet. pm and OAlParser.pm for the harvester. 

These files are included with this paper, please download the source. 
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In order to run the example programs, you will require Perl 5.004 or later and the following modules (the 
precise version I used is given in parenthesis). For the the server: 

• XML- Writer (XML-Writer-0.4) 

and for the harvester: 

• MIME-Base64 (MIME-Base64-2.11) 

• URI (URI-1.09) 

• HTML-Tagset (HTML-Tagset-3.02) 

• HTML-Parser (HTML-Parser-3.11) 

• libnet (libnet-1.0703) 

• Digest::MD5 (Digest-MD5-2.11) 

• LWP (libwww-pcrl-5.48) 

• expat Ubrary (expat-1.95.1) 

• XML-Parser (XML-Parser-2.30) 

All of the above except for expat are available from CPAN (http://www.cpan.org/) and can be installed with 
the standard perl Makefile. PL; make; make test; make install sequence. There should not be any dependency 
problems if the modules are installed in the order listed. The expat XML parsing library upon which XML- 
Parser relies, is available from Source Forge (http://sourceforge.net/projects/expat/). 

Before running oaiharvest.pl you should first edit the line that defines the variable $contact and insert your 

e-mail address. This will then be specified as the contact address for all HTTP requests and will enable 
the server maintainer to contact you if there are problems. The example code has been tested only on a 
Linux system and with the Apache server. While I hope that it will work on other systems this has not been 
verified. 
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